feat telegram: add voice message support for telegram with pluggable#38
Conversation
|
We require contributors to sign our Contributor License Agreement, and we don't have @soufianebouaddis on file. In order for us to review and merge your code, please create a PR where you add yourself to the contributors of JobRunr. This only needs to be done once. As soon as that is done, we can review your PR. Thanks a lot! |
|
@cla-bot check |
|
The cla-bot has been summoned, and re-checked this pull request! |
…тестами для audit/executions/deliveries REST API endpoints (T58, T60) с реальным PostgreSQL через Testcontainers — все тесты зелёные, BUILD SUCCESS
…грация role_agent_config, RoleAgentConfig entity/repository/service с hierarchy fallback и кэшем, интеграция model override через весь pipeline (ChatRestController→SseStreamingService→ChatService), REST API endpoints для управления, 26 новых тестов, все 1147 тестов зелёные
|
Hi @soufianebouaddis thanks for submitting this PR. Sorry for the late review. From what I can see the actual transcription is yet to be done. I think we should have at least one working implementation. Is this something you'd still like to work on? My second concern is that we're mixing telegram text and voice messages, is it possible to find a nice abstraction? |
|
Hi @auloin, thanks for the review and sorry for the incomplete implementation. I’ll continue working on this and add a concrete transcription provider so the feature works end-to-end. I’ll also revisit the current design to avoid mixing text and voice handling in TelegramChannel and introduce a cleaner abstraction for message types. I’ll update the PR shortly with these changes. Thanks for the feedback! |
|
Hi @auloin, this update extending TelegramChannel.consume() to handle both text and voice inputs through a single flow. Voice messages are downloaded via TelegramVoiceDownloader, transcribed to text using a SpeechToTextService abstraction, and then passed to agent.respondTo() the same way as text messages. I added working transcription implementations (local via whisper-cli + ffmpeg, and OpenAI), with a mock still available for testing. The flow normalizes everything to text before reaching the agent, so text and voice are no longer mixed beyond the input layer. |
|
Thanks @soufianebouaddis. I'll review it as soon as possible. In the meantime could you already pull the main branch into your branch and solve the conflicts? |
|
Hi @auloin, thanks for the heads up I’ll pull the latest changes from |
33d5bf0 to
21404e0
Compare
|
Hi @auloin, Here are the relevant logs: Exception in thread "pool-4-thread-1" java.lang.RuntimeException: Failed to send both HTML and fallback messages Would you prefer that I implement message chunking to split long responses, or should we explore another approach? |
Interesting finding @soufianebouaddis. Does it happen often? Is it a blocker? I wonder if it can be a separate task so we can keep the scope of this PR small. |
auloin
left a comment
There was a problem hiding this comment.
Thanks again for the work on this @soufianebouaddis. I think we're pretty close, I have a few remarks to see if we cannot simplify the implementation a bit.
|
|
||
| @Service | ||
| @ConditionalOnProperty(name = "speech.provider", havingValue = "whisper-cpp") | ||
| public class WhisperCppSpeechToTextService implements SpeechToTextService { |
There was a problem hiding this comment.
I've been looking for a java library that does speech to text and I found vosk: https://github.com/alphacep/vosk-api. If it works, what do you think of making it the default @soufianebouaddis? We could also drop this implementation which requires having both ffmpeg and whisper-cli.
|
…/speech/MockSpeechToTextService and refactor OpenAiSpeechToTextService to deletate it to SpringAI
|
Hi @auloin, thanks again for the review. I pushed a new revision addressing the first two remarks: MockSpeechToTextService has been moved out of production code into src/test, since keeping it as a default runtime fallback was indeed misleading. I also spent some time looking into Vosk as an alternative to WhisperCppSpeechToTextService. It is definitely attractive from a portability standpoint since it removes the whisper-cli dependency, but Telegram voice messages still come in OGG format and Vosk expects WAV input, so an audio conversion step is still required unless we introduce an additional Java decoder. As for the message is too long exception I hit during testing, it does not happen often and it is not related to voice handling itself it only occurs when the generated agent reply exceeds Telegram’s 4096 character limit, which can happen with regular text messages as well. |
Add support for voice messages in Telegram channel
This PR extends
TelegramChannelto handle voice messages in addition to text.Changes
TelegramChannel.consume()to process both text and voice messagesSpeechToTextServiceabstractionagent.respondTo(...)flowTranscription
MockSpeechToTextService(no external dependency, suitable for testing)OpenAiSpeechToTextService(enabled viaspeech.provider=openai)Notes
Next Steps / Ideas
AudioTranscriptionModel, or local Whisper plugin)AudioTranscriptionModelif preferred